Add SSE watchdog and improve connection error handling#20
Merged
Conversation
Adds a watchdog thread that monitors SSE connection health by tracking when data (including keepalives) is received. If no data is received for 120 seconds (configurable), the watchdog: 1. Logs a warning 2. Polls the checkpoint API for fresh config data 3. Closes the SSE client to force reconnection This helps detect and recover from stuck SSE connections that may not trigger normal timeout/error handling (e.g., proxy issues, half-open connections). Additional improvements: - Changed except Exception to except BaseException to catch GeneratorExit and other BaseException subclasses that could silently kill the thread - Added logging when streaming loop exits (with shutdown reason) - Fixed backoff logging to show actual sleep time instead of pre-doubled value - Removed dead code (ConfigSDK.sse_client was never assigned) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous code caught UnauthorizedException which is never raised by raise_for_status(). Instead, HTTPError is raised. This change: - Catches HTTPError and inspects response.status_code for 401/403 - Removes dead UnauthorizedException catch block - Adds specific tests for 401 and 403 responses Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
If load_checkpoint() fails (no data found or unexpected exception), streaming would never start because finish_init() was never called. This fix starts streaming as a fallback when checkpoint loading fails, but does NOT call finish_init() - this preserves the timeout behavior where get() blocks until timeout if no data is available. - Start streaming when CDN and cache both fail to load - Start streaming when unexpected exception occurs - Do NOT start streaming on UnauthorizedException (handled separately) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Bump version to 1.2.0 for SSE watchdog and error handling improvements - Add dev_runner.py for observing SDK behavior during development Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
722c709 to
71b21e6
Compare
Poetry 2.x installation was failing in GitHub Actions. Pin to 1.8.5 for stability. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses issues where customers were reporting stale configuration data. Investigation revealed several failure modes where the SSE streaming connection could silently fail or never start, leaving clients stuck with outdated configs.
Changes
SSE Watchdog: New monitoring thread that detects stuck SSE connections by tracking keepalive activity. If no data is received for 120 seconds (4 missed 30s keepalives), it triggers recovery by polling for fresh data and forcing SSE reconnection.
Fixed 401/403 handling: The previous code caught
UnauthorizedExceptionwhich is never raised byraise_for_status(). Now properly catchesHTTPErrorand inspectsresponse.status_codefor 401/403.Fixed silent loop exits: Changed
except Exceptiontoexcept BaseExceptionand addedfinallyblock logging to detect when the streaming loop exits unexpectedly.Fixed streaming startup on checkpoint failure: If checkpoint loading fails (CDN down, unexpected exception), streaming now starts as a fallback so SSE can potentially load configs. Preserves timeout behavior for
get()calls.Dev runner script: Added
dev_runner.pyfor observing SDK behavior during development.Files Changed
sdk_reforge/_sse_watchdog.pysdk_reforge/_sse_connection_manager.pysdk_reforge/config_sdk.pytests/test_sse_watchdog.pytests/test_sse_connection_manager.pytests/test_config_sdk.pydev_runner.pyTest plan
handle_unauthorized_responseis calleddev_runner.pyto observe SSE connection and watchdog behavior🤖 Generated with Claude Code